NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

https://doi.org/10.21437/Interspeech.2024-2248

Shi, Jiatong; Wang, Shih-Heng; Chen, William; Bartelds, Martijn; Bannihatti_Kumar, Vanya; Tian, Jinchuan; Chang, Xuankai; Jurafsky, Dan; Livescu, Karen; Lee, Hung-yi; et al (September 2024, ISCA)

Full Text Available
Towards Robust Speech Representation Learning for Thousands of Languages

https://doi.org/10.18653/v1/2024.emnlp-main.570

Chen, William; Zhang, Wangyou; Peng, Yifan; Li, Xinjian; Tian, Jinchuan; Shi, Jiatong; Chang, Xuankai; Maiti, Soumi; Livescu, Karen; Watanabe, Shinji (January 2024, Association for Computational Linguistics)

Full Text Available
Audio-Visual Neural Syntax Acquisition

Lai, Cheng-I Jeff; Shi, Freda; Peng, Puyuan; Kim, Yoon; Gimpel, Kevin; Chang, Shiyu; Chuang, Yung-Sung; Bhati, Saurabhchand; Cox, David; Harwath, David; et al (December 2023, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU))

Full Text Available
Chess as a Testbed for Language Model State Tracking

https://doi.org/10.1609/aaai.v36i10.21390

Toshniwal, Shubham; Wiseman, Sam; Livescu, Karen; Gimpel, Kevin (June 2022, Proceedings of the AAAI Conference on Artificial Intelligence)

Transformer language models have made tremendous strides in natural language understanding tasks. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. Moreover, we observe that the appropriate choice of chess notation allows for directly probing the world state, without requiring any additional probing-related machinery. We find that: (a) With enough training data, transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences. (b) For small training sets providing access to board state information during training can yield significant improvements. (c) The success of transformer language models is dependent on access to the entire game history i.e. “full attention”. Approximating this full attention results in a significant performance drop. We propose this testbed as a benchmark for future work on the development and analysis of transformer language models.
more » « less
Full Text Available
Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

https://doi.org/10.1109/SLT48900.2021.9383578

Shi, Bowen; Settle, Shane; Livescu, Karen (January 2021, IEEE Workshop on Spoken Language Technology)

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models.
more » « less
Full Text Available
Acoustic Span Embeddings for Multilingual Query-by-Example Search

https://doi.org/10.1109/SLT48900.2021.9383545

Hu, Yushi; Settle, Shane; Livescu, Karen. (January 2021, IEEE Workshop on Spoken Language Technology)

Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW). Recent work has found that methods based on acoustic word embeddings (AWEs) can improve both performance and search speed. However, prior work on AWE-based QbE has primarily focused on English data and with single-word queries. In this work, we generalize AWE training to spans of words, producing acoustic span embeddings (ASE), and explore the application of ASE to QbE with arbitrary-length queries in multiple unseen languages. We consider the commonly used setting where we have access to labeled data in other languages (in our case, several low-resource languages) distinct from the unseen test languages. We evaluate our approach on the QUESST 2015 QbE tasks, finding that multilingual ASE-based search is much faster than DTW-based search and outperforms the best previously published results on this task.
more » « less
Full Text Available
Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Pasad, Ankita; Chou, Ju-Chieh; Livescu, Karen. (January 2021, IEEE Automatic Speech Recognition and Understanding Workshop - ASRU 2021)

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.
more » « less
Full Text Available
PeTra: A Sparsely Supervised Memory Model for People Tracking

https://doi.org/10.18653/v1/2020.acl-main.481

Toshniwal, Shubham; Ettinger, Allyson; Gimpel, Kevin; Livescu, Karen (July 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics)

We propose PeTra, a memory-augmented neural network designed to track entities in its memory slots. PeTra is trained using sparse annotation from the GAP pronoun resolution dataset and outperforms a prior memory model on the task while using a simpler architecture. We empirically compare key modeling choices, finding that we can simplify several aspects of the design of the memory module while retaining strong performance. To measure the people tracking capability of memory models, we (a) propose a new diagnostic evaluation based on counting the number of unique entities in text, and (b) conduct a small scale human evaluation to compare evidence of people tracking in the memory logs of PeTra relative to a previous approach. PeTra is highly effective in both evaluations, demonstrating its ability to track people in its memory despite being trained with limited annotation.
more » « less
Full Text Available
Multilingual Jointly Trained Acoustic and Written Word Embeddings

Hu, Yushi; Settle, Shane; Livescu, Karen (January 2020, Interspeech)

Full Text Available
UNSUPERVISED PRE-TRAINING OF BIDIRECTIONAL SPEECH ENCODERS VIA MASKED RECONSTRUCTION

Wang, Weiran; Tang, Qingming; Livescu, Karen (January 2020, Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing)

Full Text Available

« Prev Next »

Search for: All records